What makes a game popular?

In this project I want to find what qualities, or attributes coorelate to a game being popular. I will be using data taken from Steam, as they are the most dominant game distributer in the entire PC market. Steam also provides API's that makes it possible to keep detail track of game sales and popularity.

I found this dataset in Kaggle, that uses Steam's API to track game sales and more: https://www.kaggle.com/datasets/nikdavis/steam-store-games?select=steam.csv

Here are some tools, that I will be using for this project:

In [ ]:
#import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt
#import numpy as np
#from sklearn.model_selection import train_test_split
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
#from sklearn import tree
#from sklearn.ensemble import RandomForestClassifier
#from sklearn.linear_model import LogisticRegression
#from sklearn.impute import SimpleImputer
#from sklearn.pipeline import Pipeline
#from sklearn.inspection import PartialDependenceDisplay

Starting with the preprocessing, I had to drop a few irrelevant columns. Appid is basically the ID for each game/app, but I can't really do anything with that so I dropped it. Achivements, might in a way affect how a game is percieved but afterall I ended up dropping it. I also found median_playtime a bit redundant as I feel avarage_playtine sufices.

Then I made sure that every entry is rated, and have owners. These will be how I could gauge what makes a game popular. I alse couldn't have duplicates of the same game, I made sure there couldn't be duplicates of games made by the same developer

I founded way more helptful if I just grabbed a percentage of a games good ratings to form a game's score. Then I used formatted it into a percentage. Finally I drop both ratting columns as the score column basicallt did their jobs.

In [4]:
df = pd.read_csv('steamcharts/steam.csv')
df.columns
df = df.drop(columns=['appid', 'achievements', 'median_playtime'])
df = df.dropna(subset=["positive_ratings"])
df = df.dropna(subset=["negative_ratings"])
df = df.dropna(subset=["owners", "developer"])
df = df.drop_duplicates('name')
rating_percentage = df['positive_ratings'] / (df['positive_ratings'] + df['negative_ratings'])
rating_percentage = rating_percentage.apply(lambda x: round(x * 100, 2))
rating_percentage = rating_percentage.fillna(0)
df['score'] = rating_percentage
df["english"] = df["english"].astype(bool)
df = df.drop(columns=['positive_ratings', 'negative_ratings'])
df = df.sort_values(by='owners', ascending=False)
df = df.reset_index(drop=True)
df
Out[4]:
name release_date english developer publisher platforms required_age categories genres steamspy_tags average_playtime owners price score
0 PLAYERUNKNOWN'S BATTLEGROUNDS 12/21/2017 True PUBG Corporation PUBG Corporation windows 0 Multi-player;Online Multi-Player;Stats Action;Adventure;Massively Multiplayer Survival;Shooter;Multiplayer 22938 50000000-100000000 26.99 50.46
1 Counter-Strike: Global Offensive 8/21/2012 True Valve;Hidden Path Entertainment Valve windows;mac;linux 0 Multi-player;Steam Achievements;Full controlle... Action;Free to Play FPS;Multiplayer;Shooter 22494 50000000-100000000 0.00 86.80
2 Infestation: The New Z 11/22/2016 True Fredaikis AB Fredaikis AB windows 0 Online Multi-Player;MMO;Online Co-op;Steam Ach... Action;Free to Play;Indie;Massively Multiplaye... Zombies;Free to Play;Survival 376 5000000-10000000 0.00 49.15
3 Cities: Skylines 3/10/2015 True Colossal Order Ltd. Paradox Interactive windows;mac;linux 0 Single-player;Steam Achievements;Steam Trading... Simulation;Strategy City Builder;Simulation;Building 3225 5000000-10000000 22.99 91.84
4 PlanetSide 2 11/20/2012 True Daybreak Game Company Daybreak Game Company windows 0 Multi-player;MMO;Steam Trading Cards Action;Free to Play;Massively Multiplayer Free to Play;Massively Multiplayer;FPS 1051 5000000-10000000 0.00 82.27
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27027 You Are God 5/19/2017 True BefuddleBug BefuddleBug windows 0 Single-player Indie;Simulation;Strategy;Early Access Early Access;Strategy;Indie 111 0-20000 3.99 25.00
27028 Tales of Terror: Crimson Dawn 3/1/2017 True Deep Shadows Games Big Fish Games windows 0 Single-player Adventure;Casual Casual;Adventure;Hidden Object 0 0-20000 4.99 0.00
27029 S-COPTER: Trials of Quick Fingers and Logic 2/12/2018 True Noble Fox Games Noble Fox Games windows 0 Single-player;Local Multi-Player;Local Co-op;S... Action;Indie Action;Indie;Puzzle 0 0-20000 5.79 100.00
27030 Apocryph: an old-school shooter 4/27/2018 True Bigzur Games Bigzur Games windows;linux 0 Single-player;Steam Achievements;Steam Trading... Violent;Gore;Action;Adventure;Indie Action;Gore;Adventure 0 0-20000 11.39 60.00
27031 Rune Lord 4/24/2019 True Adept Studios GD Alawar Entertainment windows;mac 0 Single-player;Steam Cloud Adventure;Casual;Indie Indie;Casual;Adventure 0 0-20000 5.19 100.00

27032 rows × 14 columns

I want to visualize the relation of game score to overall ownership of the game. I will filter the database based on the games with the top ownership. Then based on the score I will catagorize, the scores into general reception and create a pie chart off of it. Pie charts, Is an effective way to show the percentage of catagorical data from the whole.

In [5]:
top = df[df["owners"].isin(["50000000-100000000", "5000000-10000000"])]
top["score"] = pd.to_numeric(top["score"], errors="coerce")
top = top.dropna(subset=["score"])
def lable(lbl):
    if lbl > 0 and lbl <= 20:
        return "Very Poor"
    elif lbl <= 40:
        return "Poor"
    elif lbl <= 60:
        return "Average"
    elif lbl <= 80:
        return "Good"
    elif lbl <= 90:
        return "Very Good"
    return None
top["reception"] = top["score"].apply(lable)
top = top.dropna(subset=["reception"])
counts = top["reception"].value_counts()
counts = top["reception"].value_counts()
plt.pie(counts, labels=counts.index, autopct=lambda pct: '{:.1f}%'.format(pct) if pct > 0 else '')
plt.title("Score distribution of games with most played games")
plt.show()
C:\Users\holyw\AppData\Local\Temp\ipykernel_13012\2212385856.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top["score"] = pd.to_numeric(top["score"], errors="coerce")
No description has been provided for this image

The chart suggest, that games that are most played are exclusively either avarage or very good. This demostrate that game scores are good indicators of player numnbers. To test my Idea, now I will pit games ownership, to its reception and see if better games are more often played

Since scores are qualtative data, and the ownership count is catagorical data A bar chart showing an inverse of this realtionship wouldn't work nicely. Instead I could group owners with the avarage score for that owner count, and show that relationship in a bar chart

In [6]:
avg_scores = df.groupby("owners")["score"].mean()

avg_scores.plot(kind="bar", rot=45)
plt.title("Average game score by ownership category")
plt.ylabel("Average Score")
plt.show()
No description has been provided for this image

This bar chart, suggest no clear corrolation between a game rating and owner count. The graph does suggest that on avarage, most games in the dataset are positively viewed. This makes me concluded, that game rating isn't a strong indicator of owner count.

If I want to get a concensus of what types of attributes could predict a game with high ownership, an effective way would be to create a data tree.

To build a classification tree we must first we sellect our target, which would be games with owner ship count 500,000-10,000,000. So I need to make my target into binary. Now I how to deal with some string entries in my dataset. For category, genre, tags, and platforms there are multiple things listed there seperated by ;. I'm going to have convert each one of these seperate entires into int columns, by using get_dummies and seperating each one of these multiple entries by the ;. Then I aggragate these columns into the dataset (312 new columns). Now I have to select the columns that will be paired up with the target, essentially I picked every column that was not the target, Name, releasdate, publishers, developers, and owners. The target and owners are self explanatory, but I chose to exclude the others because they can't really give me generalize insight into what type of games are popular. Using what we found, I created the training and test variables for X and y respectively. Now we split our data into 80% train, and 20% test.

Now that we got what we needed it's time to build our decission tree classifier. I initialized and fitted the classification decision tree, and I plotted it. Most branches indicate what classifications would occur in a game not in the top most owned. The only boxes that indicate a game in the top are gini, which implies that the data is impure anyways. The model has an accuracy score of 99.6%, that may look graeat but often imbalanced data can make this score missleading. When I counted the value counts, only .1776% would classify to be in the top played games. I'm going to need a more accurate model to find what predicts something this small.

In [57]:
plt.figure(figsize=(75, 20), dpi=300)
tree.plot_tree(dtc, feature_names=X.columns, filled = True, rounded = True, class_names=["not top games", "top games"], fontsize=8)
plt.show()
accuracy_score(y_test, pred)
y.value_counts(normalize = True)
No description has been provided for this image
Out[57]:
target
0.0    0.998224
1.0    0.001776
Name: proportion, dtype: float64

Random forrest classification is more accurate than classification trees, because they are made up of several classification trees on random subsets of data. This reduces varaince, and you can train them will a balanced class weight. I will check the classification accuracy with confusion matrix, which reveals which catagories of the target did it correctly and incorrectly predict. In tandem with the matrix, I will also use a classification report to get a more detail picture. As you can see the difference between the majority and monority of game is so large that, the majority is completly ignored. Even with the greater accuracy, random forrest is no the best fit for a set this imbalanced. Some might urge to maybe reevaluate how I define top games, and maybe make it more inclusive. I say that that would defeat the purpose of trying to figuere out what can predict a top game. In real life, only a few games can really reach the threshold that would make it widely owned. I might need to go with another model that might handle imbalanced sets better.

In [48]:
rf = RandomForestClassifier(n_estimators=200, class_weight="balanced", random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
confusion_matrix(y_test, pred)
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     21585
         1.0       0.00      0.00      0.00        40

    accuracy                           1.00     21625
   macro avg       0.50      0.50      0.50     21625
weighted avg       1.00      1.00      1.00     21625

c:\Users\holyw\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\metrics\_classification.py:1731: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
c:\Users\holyw\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\metrics\_classification.py:1731: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
c:\Users\holyw\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\metrics\_classification.py:1731: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])

Logistic regression is a technique that tells us the probability of whether a game with this attribute would be in the top games. I can set extra weights on the top games, and make the model more sensative to this outcome. One thing that I did not expect was that random trees handled NaN values, while logistic regression you need something like a imputer to handle them. After training the model with our data, we see a much truthful report of our model. There is still a massive biased towards not top games, but atleast some top games are being accounted. I decided to go with a coeffiecient plot to show what coefficients (categories) where the most influencial. I did not expect how influencial english, so much so that it trounced all other categories. I decided to try again, but this time I excluded english, as that is something that is often a given and not much of an attribute

In [62]:
num_cols = ["english", "required_age", "average_playtime", "score"]
dummy_cols = [c for c in X_train.columns if c not in num_cols]
imputer = SimpleImputer(strategy="median")
X_train_num = pd.DataFrame(imputer.fit_transform(X_train[num_cols]), columns=num_cols, index=X_train.index)
X_test_num = pd.DataFrame(imputer.transform(X_test[num_cols]), columns=num_cols, index=X_test.index)
X_train_clean = pd.concat([X_train[dummy_cols], X_train_num], axis=1).fillna(0)
X_test_clean = pd.concat([X_test[dummy_cols], X_test_num], axis=1).fillna(0)
logreg = LogisticRegression(class_weight="balanced", max_iter=1000, solver="lbfgs", multi_class="ovr", random_state=42)
logreg.fit(X_train_clean, y_train)
pred = logreg.predict(X_test_clean)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
coefs = pd.Series(logreg.coef_[0], index=X_train_clean.columns)
top_features = coefs.abs().sort_values(ascending=False).head(15).index
coefs_top = coefs.loc[top_features].sort_values()

plt.figure(figsize=(10,6))
coefs_top.plot(kind="barh", color=np.where(coefs_top>0, "green", "red"))
plt.title("Top Features Influencing 'Top Games'")
plt.xlabel("Coefficient (log-odds impact)")
plt.show()

coefs = coefs.drop("english", errors="ignore")

top_features = coefs.abs().sort_values(ascending=False).head(15).index
coefs_top = coefs.loc[top_features].sort_values()
plt.figure(figsize=(10,6))
coefs_top.plot(kind="barh", color=np.where(coefs_top>0, "green", "red"))
plt.title("Top Features Influencing 'Top Games' (without 'english')")
plt.xlabel("Coefficient (log-odds impact)")
plt.show()
c:\Users\holyw\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py:1281: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.8. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning.
  warnings.warn(
[[5185  212]
 [   3    7]]
              precision    recall  f1-score   support

         0.0       1.00      0.96      0.98      5397
         1.0       0.03      0.70      0.06        10

    accuracy                           0.96      5407
   macro avg       0.52      0.83      0.52      5407
weighted avg       1.00      0.96      0.98      5407

No description has been provided for this image
No description has been provided for this image

Now that I now what features are the most influencial, I will know use a Partial Dependence Plot to visualize the probabilty at which threshold of these multiple features relate to a top game. If you notice none of these features cross the 50% threshold, meaning that none of these individually predicts that a game would be a top game. Instead a combination of these features is what predicts a game bieng in the top most owned. If you're looking to predict what qualities make a game most owned, atleast according to steam, look for a mature, addicting, critically adore game!

In [71]:
logreg.fit(X_train_clean, y_train) 
PartialDependenceDisplay.from_estimator(
logreg, X_train_clean, features=["score", "required_age", "average_playtime"], kind="average", grid_resolution=50)
plt.show()
c:\Users\holyw\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py:1281: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.8. Use OneVsRestClassifier(LogisticRegression(..)) instead. Leave it to its default value to avoid this warning.
  warnings.warn(
No description has been provided for this image

This project shows how data can reveal what really drives a game’s popularity on Steam. By identifying the features that most influence ownership, it highlights how critical design choices like playtime, score, and audience rating all combine to shape success. The analysis also shows that no single factor guarantees a top game, but together they create patterns that can help developers and publishers make smarter decisions. This has value not only for game studios trying to understand their audience, but also for players curious about what sets their favorite games apart. Ultimately, it demonstrates how data science can uncover insights in gaming that go beyond opinion or guesswork.

Sources

This project ballooned out of my range of competency, so apart from what I learned from class I used alot of external sources too.

  • https://scikit-learn.org/stable/modules/partial_dependence.html
  • https://stackoverflow.com/questions/52458426/random-forests-interpretability/52472017
  • https://www.youtube.com/watch?v=rijqfllOq6g
  • https://www.youtube.com/watch?v=SjOfbbfI2qY&t=1s
  • https://www.youtube.com/watch?v=_L39rN6gz7Y&t=290s
  • https://www.reddit.com/r/datascience/comments/6k4b3j/logistic_regression_v_random_forest/
  • chatGPT